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EM Algorithms for Weighted-Data Clustering 
with Application to Audio-Visual Scene Analysis 

Israel D. Gebru, Xavier Alameda-Pineda, Florence Forbes and Radu Horaud 


Abstract —Data clustering has received a lot of attention and 
numerous methods, algorithms and software packages are avail¬ 
able. Among these techniques, parametric finite-mixture models 
play a central role due to their Interesting mathematical proper¬ 
ties and to the existence of maximum-likelihood estimators based 
on expectation-maximization (EM). In this paper we propose a 
new mixture model that associates a weight with each observed 
point. We introduce the weighted-data Gaussian mixture and we 
derive two EM algorithms. The first one considers a fixed weight 
for each observation. The second one treats each weight as a 
random variable following a gamma distribution. We propose 
a model selection method based on a minimum message length 
criterion, provide a weight initialization strategy, and validate 
the proposed algorithms by comparing them with several state 
of the art parametric and non-parametric clustering techniques. 
We also demonstrate the effectiveness and robnstness of the 
proposed clustering technique in the presence of heterogeneous 
data, namely audlo-visnal scene analysis. 

Index Terms —finite mixtures, expectation-maximization, 
weighted-data clustering, robust clustering, outlier detection, 
model selection, minimum message length, audio-visual fusion, 
speaker localization. 


I. Introduction 

Finding significant groups in a set of data points is a 
central problem in many fields. Consequently, clustering has 
received a lot of attention, and many methods, algorithms 
and software packages are available today. Among these 
techniques, parametric finite mixture models play a paramount 
role, due to their interesting mathematical properties as well as 
to the existence of maximum likelihood estimators based on 
expectation-maximization (EM) algorithms. While the finite 
Gaussian mixture (GMM) III is the model of choice, it is 
extremely sensitive to the presence of outliers. Alternative 
robust models have been proposed in the statistical literature, 
such as mixtures of t-distributions 12 and their numerous 
variants, e.g. 0, a, 0, 0, o, 0. In contrast to the 
Gaussian case, no closed-form solution exists for the t- 
distribution and tractability is maintained via the use of EM 
and a Gaussian scale mixture representation, T(a;|/z, S, a) = 
/“A/'(a;|/x,S/ w) ^(ui, a/2, a/2)(iu>, where x is an ob¬ 
served vector, N is the multivariate Gaussian distribution with 
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mean fi and covariance S /w, and Q is the gamma distribution 
of a univariate positive variable w parameterized by a. In the 
case of mixtures of t-distributions, with mixing coefficients 
J2k=i'^kT{x\fj,f,,'Ei:,ak), a latent variable w can also 
be introduced. Its distribution is a mixture of K gamma 
distributions that accounts for the component-dependent ak 
0. Clustering is then usually performed associating a positive 
variable Wi, distributed as w, with each observed point Xi. 
The distributions of both Wi and Xi do not depend on i. 
The observed data are drawn from i.i.d. variables, distributed 
according to the t-mixture, or one of its variants 0,0, a, 
0 , 0 , 0 , 0 . 

In this paper we propose a finite mixture model in which 
variable Wi is used as a weight to account for the reliability of 
the observed Xi and this independently on its assigned cluster. 
The distribution of Wi is not a gamma mixture anymore but has 
to depend on i to allow each data point to be potentially treated 
differently. In contrast to mixtures of t-distributions, it follows 
that the observed data are independent but not identically 
distributed. We introduce the weighted-data Gaussian mixture 
model (WD-GMM). We distinguish two cases, namely (i) the 
weights are known a priori and hence they are fixed, and 
(ii) the weights are modeled as variables and hence they 
are iteratively updated, given initial estimates. We show that 
in the case of fixed weights, the GMM parameters can be 
estimated via an extension of the standard EM which will be 
referred to as the fixed weighted-data EM algorithm (EWD- 
EM). Then we consider the more general case of weights that 
are treated as random variables. We model these variables 
with gamma distributions (one distribution for each variable) 
and we formally derive a closed-form EM algorithm which 
will be referred to as the weighted-data EM algorithm (WD- 
EM). While the M-step of the latter is similar to the M-step 
of EWD-EM, the E-step is considerably different as both the 
posterior probabilities (responsibilities) and the parameters of 
the posterior gamma distributions (the weights) are updated 
(E-Z-step and E-W-step). The responsibilities are computed 
using the Pearson type VII distribution (the reader is referred 
to 0 for a recent discussion regarding this distribution), 
also called the Arellano-Valle and Bolfarine generalized t- 
distribution 0 , and the parameters of the posterior gamma 
distributions are computed from the prior gamma parameters 
and from the Mahalanobis distance between the data and the 
mixture means. Note that the weights play a different role 
than the responsibilities. Unlike the responsibilities, which 
are probabilities, the weights are random variables that can 
take arbitrary positive values. Their posterior means can be 
used as an absolute measure of the relevance of the data. 
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Typically, an outlying data point which is far from any cluster 
center will have a small weight while it may still be assigned 
with a significant responsibility value to the closest cluster. 
Responsibilities indicate which cluster center is the closest 
but not if any of them is close at all. 

The idea of weighted-data clustering has already been pro¬ 
posed in the framework of non-parametric clustering methods 
such as iT-means and spectral clustering, e.g. m, HD, na, 
ns. These methods generally propose to incorporate prior 
information in the clustering process in order to prohibit 
atypical data (outliers) to contaminate the clusters. The idea 
of modeling data weights as random variables and to estimate 
them via EM was proposed in IH in the particular framework 
of Markovian brain image segmentation. In lfT4l it is shown 
that specific expert knowledge is not needed and that the data- 
weight distribution guide the model towards a satisfactory 
segmentation. A variational EM is proposed in na as their 
formulation has no closed form. In this paper we build on 
the idea that, instead of relying on prior information about 
atypical data, e.g. ||T0|, HD, HD, HD, we devise a novel EM 
algorithm that updates the weight distributions. The proposed 
method belongs to the robust clustering category of mixture 
models because observed data that are far away from the 
cluster centers have little influence on the estimation of the 
means and covariances. 

An important feature of mixture based clustering methods 
is to perform model selection on the premise that the number 
of components K in the mixture corresponds to the number of 
clusters in the data. Traditionally, model selection is performed 
by obtaining a set of candidate models for a range of values of 
K (assuming that the true value is in this range). The number 
of components is selected by minimizing a model selection cri¬ 
teria, such as the Bayesian inference criterion (BIC), minimum 
message length (MML), Akaike’s information criteria (AIC) to 
cite just a few HI, na. The disadvantage of these methods is 
twofold. Eirstly, a whole set of candidates has to be obtained 
and problems associated with running EM many times may 
emerge. Secondly, they provide a number of components that 
optimally approximate the density and not the true number of 
clusters present in the data. More recently, there seems to be 
a consensus among mixture model practitioners that a well- 
founded and computationally efficient model selection strategy 
is to start with a large number of components and to merge 
them na.HS proposes a practical algorithm that starts with a 
very large number of components (thus making the algorithm 
robust to initialization), iteratively annihilates components, 
redistributes the observations to the other components, and 
terminates based on the MML criterion. HD starts with an 
overestimated number of components using BIC, and then 
merges them hierarchically according to an entropy criterion. 
More recently HU proposes a similar method that merges 
components based on measuring their pair-wise overlap. 

Another trend in handling the issue of finding the proper 
number of components is to consider Bayesian non-parametric 
mixture models. This allows the implementation of mixture 
models with an infinite number of components via the use 
of Dirichlet process mixture models. In ifT^ . Il20l an infinite 


Gaussian mixture (IGMM) is presented with a computationally 
intensive Markov Chain Monte Carlo implementation. At first 
glance, IGMM may appear similar to EWD-EM. However, 
these two algorithms are quite different. While IGMM is fully 
Bayesian the proposed EWD-EM is not, in the sense that no 
priors are assumed on the parameters, typically the means and 
covariance matrices. IGMM implies Student predictive distri¬ 
butions while EWD-EM involves only Gaussian distributions. 


More recently, more flexibility in the cluster shapes has been 
allowed by considering infinite mixture of infinite Gaussian 
mixtures (PGMM) II 2 TII . The flexibility is however limited 
to a cluster composed of sub-clusters of identical shapes 
and orientations, which may alter the performance of this 
approach. Altogether, IGMM and I^GMM are not designed 
to handle outliers, as illustrated in Section VIII Eigs. |^f and 
l^g. Infinite Student mixture models have also been considered 
lEll . but inference requires a variational Bayes approximation 
which generates additional computational complexity. 


Bayesian non-parametrics, although promising techniques, 
require a fully Bayesian setting. The latter, however, induces 
additional complexity for handling priors and hyper-priors, 
especially in a multi-variate context. In contrast, our latent 
variable approach allows exact inference. With respect to 
model selection, we therefore propose to extend the method of 
ifTSll to weighted-data mixtures. We formally derive an MML 
criterion for the weighted-data mixture model and we plug 
this criterion into an efficient algorithm which, starting with 
a large number of components, simultaneously estimates the 
model parameters, the posterior probabilities of the weights 
and the optimal number of components. 


We also propose to apply the proposed weighted-data robust 
clustering method to the problem of fusing auditory and visual 
information. This problem arises when the task is, e.g. to 
detect a person that is both seen and heard, such as an active 
speaker. Single-modality signals - vision-only or audio-only 
- are often either weak or ambiguous, and it may be useful 
to combine information from different sensors, e.g. cameras 
and microphones. There are several difficulties associated 
with audio-visual fusion from a data clustering perspective: 
the two sensorial modalities (i) live in different spaces, 
(ii) are contaminated by different types of noise with different 
distributions, (iii) have different spatiotemporal distributions, 
and (iv) are perturbed by different physical phenomena, e.g. 
acoustic reverberations, lighting conditions, etc. Eor example, 
a speaker may face the camera while he/she is silent and may 
emit speech while he/she turns his/her face away from the 
camera. Speech signals have sparse spectro-temporal structure 
and they are mixed with other sound sources, such as music 
or background noise. Speaker faces may be totally or partially 
occluded, in which case face detection and localization is 
extremely unreliable. We show that the proposed method is 
well suited to find audio-visual clusters and to discriminate 
between speaking and silent people. 

The remainder of this paper is organized as follows. Sec¬ 


sketches the EWD-EM algorithm. Weights modeled with ran- 


tion O outlines the weighted-data mixture model; Section III 
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dom variables are introduced in Section |IV] and the WD-EM is 
described in detail in Section|V] Section|^details how to deal 
with an unknown number of clusters and Section lVnl addresses 
the issue of algorithm initialization. In Section VIII the pro¬ 
posed algorithms are tested and compared with several other 
parametric and non-parametric clustering methods. Section [IX| 
addresses clustering of audio-visual data. Section [X| concludes 
the paper. Additional results and videos are available online[^ 


II. Gaussian Mixture with Weighted Data 

In this Section, we present the intuition and the formal 
definition of the proposed weighted-data model. Let a; S be 
a random vector following a multivariate Gaussian distribution 
with mean /r G and covariance S G namely 

p{x\6) = A/” (a:; /X, S), with the notation 6 = {/x, S}. Let 
XU > 0 be a weight indicating the relevance of the observation 
X. Intuitively, higher the weight w, stronger the impact of x. 
The weight can therefore be incorporated into the model by 
“observing x w times”. In terms of the likelihood function, this 
is equivalent to raise p{x-, 9) to the power w, i.e. N{x] /x, S)“. 
However, the latter is not a probability distribution since it 
does not integrate to one. It is straightforward to notice that 
N{x\ /X, S)’" oc N{x\ /X, S/w). Therefore, w plays the role of 
the precision and is different for each datum x. Subsequently, 
we write; 

p{x-,e,w) = u , (1) 

from which we derive a mixture model with K components: 

p(a;;0,u;) = ^TTfeA/" ^a;;/xj,, -SfcV (2) 

k=l W / 

where 0 = {tti, ... ,TrK,6i,... ,9 k} are the mixture pa¬ 
rameters, TTi,..., ttk are the mixture coefficients satisfying 
TTfc > 0 and ~ 1’ = {Mai ^fc} are the parameters 

of the fc-th component and K is the number of components. 
We will refer to the model in as the weighted-data 
Gaussian mixture model (WD-GMM). Let X = {xi, ..., Xn} 
be the observed data and W = {wi,..., Wn} be the weights 
associated with X. We assume each Xi is independently drawn 
from Q with w = Wi. The observed-data log-likelihood is: 

lnp(X;0,H^)=f^ln|^7rfeAA('a;G/Tfe,^i:A n . (3) 

i=l \fe=l ^ '' ' / 

It is well known that direct maximization of the log-likelihood 
function is problematic in case of mixtures and that the 
expected complete-data log-likelihood must be considered 
instead. Hence, we introduce a set of n hidden (assignment) 
variables Z — {zi,...,z„} associated with the observed 
variables X and such that Zi = k, k G {1,K} if and only 
if Xi is generated by the A:-th component of the mixture. In the 
following we first consider a fixed (given) number of mixture 
components K, we then extend the model to an unknown K, 
thus estimating the number of components from the data. 

'https://team.inria.fr/perception/research/wdgmm/ 


HI. EM WITH Lixed Weights 

The simplest case is when the weight values are provided 
at algorithm initialization, either using some prior knowledge 
or estimated from the observations (e.g. Section |VII[ ), and are 
then kept fixed while alternating between the expectation and 
maximization steps. In this case, the expected complete-data 
log-likelihood is; 

Q,(0,0W) =Ep(2|^^^^0M)[lnP(X,Z;Ty,0)], (4) 

where Ep[-] denotes the expectation with respect to the distri¬ 
bution P. The (r + l)-th EM iteration consists of two steps 
namely, the evaluation of the posterior distribution given the 
current model parameters 0*^*^^ and the weights W (E-step), 
and the maximization of Q with respect to 0 (M-step); 

= argiimx Qc(0, 0(’'))- (5) 

It is straightforward to show that this yields the following 
EWD-EM algorithm: 


A. The E-Step 

The posteriors = p{zi = k\xi-, wt, 0*-’'^) are updated 

with; 


(r+l) _ n P{Xi-9\ \Wi) 


Vik 


p{Xi-,&^^\w^) 

where p and p are defined in ([T]) and (|^. 


(6) 


B. The M-Step 

Expanding 0 we get; 

n K . 

Q,(0, 0«) = ^ \n7rkX-{xp, /X,; -S^) 

i=l k=l 
n K 

- ^{Xi-^kV-E-^^x,-Pk)) , (7) 

where = denotes equality up to a constant that does not depend 
on 0. By canceling out the derivatives with respect to the 
model parameters, we obtain the following update formulae 
for the mixture proportions, means, and covariances matrices: 



( 10 ) 
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IV. Modeling the Weights 

As we already remarked, the weights play the role of 
precisions. The notable difference between standard hnite 
mixture models and the proposed model is that there is a 
different weight Wi, hence a different precision, associated 
with each observation Xi. Within a Bayesian formalism, the 
weights W may be treated as random variables, rather than 
being hxed in advance, as in the previous case. Since Q is 
a Gaussian, a convenient choice for the prior on w, p{w) is 
the conjugate prior of the precision with known mean, i.e. a 
gamma distribution. This ensures that the weight posteriors 
are gamma distributions as well. Summarizing we have; 

P{w, cj)) = g {w, a,/3) = r (11) 

where Q{w;a,P) is the gamma distribution, r(Q;) = 

is the gamma function, and cf) = {a, (3} are 
the parameters of the prior distribution of w. The mean and 
variance of the random variable w are given by: 

EH = a/p, (12) 

var[rt;] = q;//3^. (13) 

V. EM WITH Random Weights 

In this section we derive the WD-EM algorithm associated 
to a model in which the weights are treated as random 
variables following The gamma distribution of each Wi 
is assumed to be parameterized by pi = {ai,Pi]. Within this 
framework, the expectation of the complete-data log-likelihood 
is computed over both the assignment and weight variables: 

2,(0, ©W) = Ep(2 VEIX-0W ^)[lnP(^, X; 0, $)], 

(14) 

where we used the notation $ = We notice 

that the posterior distribution factorizes on i: 

n 

P{Z, ©«, $) = [] P{z„wPxp, 
and each one of these factors can be be decomposed as: 

P{zt,Wt\xi-Q^^\pi) = 

P{wi\zi,xf, @^^\(f>i)P{zi\xi;&^'^\cf)P, (15) 

where the two quantities on the right-hand side of this equation 
have closed-form expressions. The computation of each one of 
these two expressions leads to two sequential steps, the E-W- 
step and the E-Z-step, of the expectation step of the proposed 
algorithm. 


A. The E-Z Step 

The marginal posterior distribution of Zi is obtained by 
integrating over Wi. As previously, we denote the re¬ 
sponsibilities with = P{zi = k\xi;&^'^\ (f>i). The 


integration computes; 

cx J P (^Xi\zi = k,Wi;&’'^^^ P{w^; 4>i) dwi 

= j p (^Xi; 0 ^//\wi'j Q{wi;ai, Pi) dwi 

(XTT^^^ V{xp,p,^j/'\'E[/\a„Pi), (16) 

where V{xi-, iJ.i^,'Eic,ai, Pi) denotes the Pearson type VII 
probability distribution function, which can be seen as a 
generalization of the t-distribution; 


V{x-fi,'E,a,P) = 

r{a + d/2) / l|a^-/^Hs \ 

|S|i/2 r(a) (27r^)‘^/2 2/3 j 


B. The E-W Step 


The posterior distribution of Wi, namely p{wi\zi = 
k,Xi-,&^^\cf>i) is a gamma distribution, because it is the 
conjugate prior of the precision of the Gaussian distribution. 
Therefore, we only need to compute the parameters of the 
posterior gamma distribution; 

P{wi\zi = k, Xi; @^'"\<pi) 

a P{xi\zi = k,Wi;&^'^'>)P{wi;(f>i) 

= N{xi;pL^^\'Il^^'’/wi) Q{wi;ai,Pi) 

= g{wp,at^^\hi+^'>), (18) 


where the parameters of the posterior gamma distribution are 
evaluated with; 




d 

= A + ^ 


(r) 



(19) 

( 20 ) 


The conditional mean of Wi, namely can then be 

evaluated with: 


—(r-t-l) T- r 1 


(r-ri) 

^ik 


( 21 ) 


While estimating the weights themselves is not needed by 
the algorithm, it is useful to evaluate them in order to fully 
characterize the observations and to discriminate between 
inkers and outliers. Eirst notice that the marginal posterior 
distribution of Wi is a mixture of gamma distributions: 


p{wi\xp,@^'^\pi) 

K 

k^l 
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and therefore the posterior mean of Wi is evaluated with; 

= E[Wi\xf, (/)J = Y. (23) 

k—i 

By inspection of ( |T9| ), ( p0| ), and ( [2T] i it is easily seen 

that the value of Wi decreases as the distance between the 
cluster centers and observation Xi increases. Importantly, the 
evaluation of Wi enables outlier detection. Indeed, an outlier 
is expected to be far from all the clusters, and therefore all 
Wik will be small, leading to a small value of Wi. It is worth 
noticing that this is not possible using only the responsibilities 
rjik, since they are normalized by dehnition, and therefore their 
value is not an absolute measure of the datum’s relevance, but 
only a relative measure of it. 

C. The Maximization Step 


VI. Estimating the Number of Components 

So far it has been assumed that the number of mixture 
components K is provided in advance. This assumption is 
unrealistic for most real-world applications. In this Section we 
propose to extend the method and algorithm proposed in ifTSll 
to the weighted-data clustering model. An interesting feature 
of this model selection method is that it does not require 
parameter estimation for many different values of K, as it 
would be the case with the Bayesian information criterion 
(BIC) II 23 I . Instead, the algorithm starts with a large number of 
components and iteratively deletes components as they become 
irrelevant. Starting with a large number of components has 
the additional advantage of making the algorithm robust to 
initialization. Formally, the parameter estimation problem is 
cast into a transmission encoding problem and the criterion 
is to minimize the expected length of the message to be 
transmitted: 


This step maximizes the expected complete-data log- 
likelihood over the mixture parameters. By expanding ( [T^ , 
we have: 


Qr(0,0^"O 


& 



X p{Wi\Xi,Z^ 


InTTfeWfa;,; 

\ Wi y 


n K 

2=1 k—1 


Pk))- 


(24) 


The parameter updates are obtained from canceling out the 
derivatives of the expected complete-data log-likelihood ( |24) l. 
As with standard Gaussian mixtures, all the updates are closed- 
form expressions; 


(r+l) 


1 

41:4 


(r-ri) 


k ’ 


2 = 1 
n 

E _(r+1) (r+l) 

^ik Vlk X, 


(r+l) _ i=l 
^k n 


'Hk 


/ . ^ik 'ki 


i=l 


(25) 


(26) 




n -j- 

2 = 1 

n 


(27) 

It is worth noticing that the M-step of the WD-EM algorithm is 
very similar to the M-step of the FWD-EM algorithm (section 
m- Indeed, the above iterative formulas, (1^, <1^ 

are identical to the formulas ([^, ([^, ( [T0| ), except that the 
hxed weights Wi are here replaced with the posterior means 
of the random weights, 


length(X, 0) = length(0) + length(X|©). (28) 


In this context, the observations and the parameters have to 
be quantized to hnite precision before the transmission. This 
quantization sets a trade off between the two terms of the 
previous equation. Indeed, when truncating to high precision, 
length(©) may be long, but length(X|0) will be short, since 
the parameters ht well the data. Conversely, if the quantization 
is coarse, length(©) may be short, but length(X|0) will 
be long. The optimal quantization step can be found by 
means of the Taylor approximation na. In that case, the 
optimization problem corresponding to the minimum message 
length (MML) criterion, is: 

©MML = argminj - logP(©) - logP(JV|©,$) 

+ ilog|I(0)| + ^(l + log^)}, (29) 


where 1(0) = —E{i9|, log P(X|©)} is the expected Fisher 
information matrix (FIM) and !?(©) denotes the dimension¬ 
ality of the model, namely the dimension of the parameter 
vector 0. Since the minimization (291 does not depend on 
the weight parameters, $ will be omitted for simplicity. 


In our particular case, as in the general case of mixtures, 
the Fisher information matrix cannot be obtained analytically. 
Indeed, the direct optimization of the log-likelihood does not 
lead to closed-form solutions. Nevertheless, it was noticed that 
the complete FIM upper bounds the FIM Qa, and that the 
expected complete-data log-likelihood lower bounds the log- 
likelihood. This allows us to write the following equivalent 
optimization problem; 

©MML = argminj - logP(0) - log Qr(0,0^’'^) 

+ ilog|Ie(©)| + ^(l + log^)}, (30) 

where Ic denotes the expected complete-FIM and Qr is 
evaluated with 

As already mentioned, because there is a different weight Wi 
for each observation i, the observed data are not identically 
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distributed and our model cannot be considered a classical 
mixture model. For this reason, the algorithm proposed in 
Ea cannot be applied directly to our model. Indeed, in the 
proposed WD-GMM setting, the complete-FIM is; 

n n 

I,(0)=diag(7ri^I,(0i),...,7rK^I,(0;f),nM) (31) 
2=1 2=1 


where h{9k) = -ElDl^logVix^lOk, ai,/3^)} is the Fisher 
information matrix for the i-th observation with respect to 
the parameter vector 6 k (mean and the covariance) of the 
fc-th component, V is defined in (17 1 , and M is the Fisher 
information matrix of the multinomial distribution, namely the 
diagonal matrix diag(7rj"^,..., We can evaluate |Ic(©)| 
from ( |3T] ): 




fc=l 


1 "" 
n 


(32) 


where M denotes the number of free parameters of each 
component. For example, M = 2d when using diagonal 
covariance matrices or M = d{d + 3)/2 when using full 
covariance matrices. 


Importantly, one of the main advantages of the methodology 
proposed in ifTSl is that one has complete freedom to choose 
a prior distribution on the parameters, P(0). In our case, 
inspired by ( [3^ , we select the following prior distributions 
for the parameters: 


p(efe) 


n 

n 


P(7ri,...,7rK) oc |M| 2 . 


(33) 

(34) 


By substitution of into ( [30| we obtain the following 

optimization problem; 


©MML = argminj — ^ log- log Qr (^0, 

(l + log^)}, 


K{M + 1) 


(35) 


where we used 17(0) = K(M + 1). 


One may notice that ( [35] l does not make sense (diverges) if 
any of the tt^’s is allowed to be null. However, in the current 
length coding framework, there is no point in transmitting the 
parameters of an empty component. Therefore, we only focus 
on the non-empty components, namely those components for 
which TTk > 0. Let denote the index set of non-empty 
components and let = |/C+| be its cardinality. We can 
rewrite as: 

©MML = argminj y ^ log tt^ - log Qr (^0, 

® ^ kGJC+ 


The above minimization problem can be solved by modi¬ 
fying the EM algorithm described in Section |V] (notice that 
there is an equivalent derivation for the fixed-weigth EM 


algorithm described in Section|nI|. Indeed, we remark that the 
minimization (36 1 is equivalent to using a symmetric improper 
Dirichlet prior for the proportions with exponent —M/2. 
Moreover, since the optimization function for the parameters 
of the Gaussian components is the same (equivalently, we used 
a flat prior for the mean vector and covariance matrix), their 
estimation formulas ( |2^ and still hold. Therefore, we 
only need to modify the estimation of the mixture proportions, 
namely; 


— K fn M\' 

Sfc'=i max |0, x:i=i Vik' - -Y I 

The max operator in (|37]l verifies whether the fc-th component 
is supported by the data. When one of the components be¬ 
comes too weak, i.e. the required minimum support M/2 can¬ 
not be obtained from the data, this component is annihilated. 
In other words, its parameters will not be estimated, since 
there is no need in transmitting them. One has to be careful in 
this context, since starting with a large value of K may lead 
to several empty components. In order to avoid this singular 
situation, we adopt the component-wise EM procedure (CEM) 
El, as proposed in M as well. Intuitively, we run both 
E and M steps for one component, before moving to the 
next component. More precisely, after running the E-Z and 
E-W steps for the component k, its parameters are updated if 
k G /C'*", otherwise the component is annihilated if fc ^ /C+. 
The rationale behind this procedure is that, when a component 
is annihilated its probability mass is immediately redistributed 
among the remaining components. Summarizing, CEM up¬ 
dates the components one by one, whereas the classical EM 
simultaneously updates all the components. 


The proposed algorithm is outlined in Algorithm [T] In 
practice, an upper and a lower number of components, ICtigh 
and ATiowj are provided. Each iteration r of the algorithm 
consists of component-wise E and M steps. If needed, some 
of the components are annihilated, and the parameters are 
updated accordingly, until the relative length difference is 
below a threshold, lALEN^^^I < In that case, if the 
message length, i.e. ( [36| is lower than the current optimum, 
the parameters, weights, and length are saved in ©min, ILmin 
and LENmin respectively. In order to explore the full range 
of K, the less populated component is artificially annihilated, 
and CEM is run again. The complexity of Algorithm is 
similar to the complexity of the algorithm in ifTSl . with the 
exception of the E-W step. However, the most computationally 
intensive part of this step (matrix inversion and matrix-vector 
multiplications in (|20ll) is already achieved in the E-Z step. 


VII. Algorithm Initialization 


The EM algorithms proposed in Section III Section |V] and 
Section |Vl] require proper initialization of both the weights 
(one for each observation and either a fixed value Wi or 
parameters ai,Pi) and of the model parameters. The K- 
means algorithm is used for an initial clustering, from which 
values for the model parameters are computed. In this section 
we concentrate onto the issue of weight initialization. An 
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Algorithm 1: WD-EM with model selection based on the 
MML criterion. _ 

input : X = ATiow,0^°^ = 

output: The minimum length mixture model: ©min and 
the final data weights: TKmin 

Set: r = 0,/C+ = LENmin = +oo 

while |/C^| > iTiow do 

repeat 

for fc = 1 to iThigh do 
E-Z step using jl6l: 

(r+l) -^ 

Vik = 






E-W step using jl9J- 20' 

'd 

2 


(r+l) 

Q;} = d. 


pir+1) _ oiO) 

^ik ~~ ft 


'^ik — 


M-step 


^(r+l) _ 


1 (’’) 
2 


x{o,Sr=i 


if tt' 


(r+l) 


E,'l+max{0.Er=i7‘r‘’-¥} 


> 0 then 


Evaluate mean usin g i26’ 


else 


and covariance using 1 27 


K+ = K+ -1 


end 


end 

0(i'+i) ^ 




Compute optimal length with 36' 

r ■<- r + 1 


until lALEN^^ 


if LEN 


(r) 


MML 

LENmin 


< £ 
< LENmin 


then 


: LEN. 


(r) 

MML 


end 


0 . ^ 0(r) 

'^min 

J^high 

min with Wi= Y. Vik'^ik 

k=l 


end 


fc* = argminj.,gx:+ (4^') , IC+ = /C+/fc* 


interesting feature of our method is that the only constraint 
on the weights is that they must be positive. Initial Wi 
values may depend on expert or prior knowledge and may be 
experiment- or goal-dependent. This model flexibility allows 
the incorporation of such prior knowledge. In the absence of 
any prior information/knowledge, we propose a data-driven 
initialization scheme and make the assumption that densely 
sampled regions are more important that sparsely sampled 
ones. We note that a similar strategy could be used if one 
wants to reduce the importance of dense data and to give more 
importance to small groups of data or to sparse data. 

We adopt a well known data similarity measure based on the 
Gaussian kernel, and it follows that the weight Wi associated 


with the data point i is evaluated with: 

E f (f{Xi,Xj) 

exp (-^- 

7G+ ^ 


(38) 


where d{xi,Xj) is the Euclidean distance, denotes the set 
containing the q nearest neighbors of Xi, and cr is a positive 
scalar. In all the experiments we used g = 20 for the simulated 
datasets and g = 50 for the real datasets. In both cases, we 
used a = 100. In the case of the FWD-EM algorithm, the 
weights Wi thus initialized remain unchanged. However, in 
the case of the WD-EM algorithm, the weights are modeled 
as latent random variables drawn from a gamma distribution, 
hence one needs to set initial values for the parameters of this 
distribution, namely ai and jdi in 0- Using ( [T^ and ( [T3| ) 
one can choose to initialize these parameters such as at =v^ 
and Pi = Wi, such that the mean and variance of the prior 
distribution are Wi and 1 respectively. 


VIII. Experimental Validation 

The proposed algorithms were tested and evaluated us¬ 
ing eight datasets: four simulated datasets and four publicly 
available datasets that are widely used for benchmarking 
clustering methods. The main characteristics of these datasets 
are summarized in Table [I] The simulated datasets (SIM) are 
designed to evaluate the robustness of the proposed method 
with respect to outliers. The simulated inkers are drawn from 
Gaussian mixtures while the simulated outliers are drawn from 
a uniform distribution, e.g. Fig. The SIM datasets have 
different cluster conhgurations in terms of separability, shape 
and compactness. The eight datasets that we used are the 
following: 

• SIM-Easy: Five clusters that are well separated and 
compact. 

• SIM-Unbalanced: Four clusters of different size and 
density. 

• SIM-Overlapped: Four clusters, two of them overlap. 

• SIM-Mixed: Six clusters of different size, compactness 
and shape. 

. MNIST contains instances of handwritten digit images 
normalized to the same size ||25]| . We preprocessed these 
data with PCA to reduce the dimension from 784 to 141, 
by keeping 95% of the variance. 


TABLE I: Datasets used for benchmarking and their charac¬ 
teristics: n is the number of data points, d is the dimension of 
the data space, and K is number of clusters. 


Data Set 

n 

d 

K 

SIM-Easy 

600 

2 

5 

SIM-Unbalanced 

600 

2 

4 

SIM-Overlapped 

600 

2 

4 

SIM-Mixed 

600 

2 

6 

MNIST l25l 

10,000 

141 

10 

Wav (26| 

5,000 

21 

3 

BCW 1271 

569 

30 

2 

Letter Recognition l28i 

20,000 

16 

26 




























■■■ 





(a) SIM-Easy (b) SIM-Unbalanced (c) SIM-Overlapped (d) SIM-Mixed 


Fig. 1: Samples of the SIM dataset with no outliers (top row) and contaminated with 50% outliers (bottom row). The 600 
inliers are generated from Gaussian mixtures while the 300 outliers are generated from a uniform distribution. 


• Wav is the Waveform Database Generator ESI . 

• BCW refers to the Breast Cancer Wisconsin data set ill, 
in which each instance represents a digitized image of a 
fine needle aspirate (FNA) of breast mass. 

• Letter Recognition contains 20,000 single-letter images 
that were generated by randomly distorting the images 
of the 26 uppercase letters from 20 different commercial 
fonts |j2^. Each letter/image is described by 16 features. 
This dataset is available through the UCI machine learn¬ 
ing repository. 

In addition to the two proposed methods (FWD-EM and 
WD-EM) we tested the following algorithms: 

. GMM uses EM with the standard Gaussian mixture 
model, implemented as described in 1^ : 

• GMM-hU uses EM with a GMM and with an additional 
uniform component, lIMIl : 

• FM-uMST stands for the finite mixture of unrestricted 
multivariate skew t-distribution algorithm of |0; 

• IGMM stands for the infinite Gaussian mixture model 

mi; 

• I^GMM stands for the infinite mixture of infinite Gaus¬ 
sian mixtures GB; 

• iT-Means is the standard iT-means algorithm; 

• KiT-Means is the kernel iT-means algorithm of 1311 ; 

• NCUT is the spectral clustering algorithm of lf3^ . 

. HAC is the hierarchical agglomerative clustering algo¬ 
rithm of 1^ . 

All the above algorithms need proper initialization. All 
the mixture-based algorithms, WD-EM, EWD-EM, GMM, 
GMMh-U, EM-uMST, IGMM and I^GMM start from the same 
proportions, means, and covariances which are estimated from 
the set of clusters provided by K-means. The latter is randomly 
initialized several times to find a good initialization. Eur- 
thermore, algorithms WD-EM, EWD-EM, GMM, GMM-rU 


and EM-uMST are iterated until convergence, i.e, the log- 
likelihood difference between two consecutive iterations is less 
than 1%, or are stopped after 400 iterations. 

To quantitatively evaluate all the tested methods, we chose 
to use the Davies-Bouldin (DB) index fM); 

1 ^ 

( 39 ) 

k=l 


where Rk = uia,yik,k^i{{Sk + Si)/dki}, Sk = 
n-k^E. — fj,k\\ l^he cluster scatter, Uk is the 

number of samples in cluster k, is the cluster center, 
and dki = ||/Xj, — /r;||. A low value of the DB index 
means that the clusters are far from each other with respect 
to their scatter, and therefore the discriminative power is 
higher. Since the algorithms are randomly initialized, we 
repeat each experiment 20 times and compute the mean and 
standard deviation of the DB index for each experiment. 
Table summarizes the results obtained with the MNIST, 
WAV, BCW, and Letter Recognition datasets. The proposed 
WD-EM method yields the best results for the WAV and 
BCW data, while the PGMM method yields the best results 
for the MNIST data. It is interesting to notice that the 
non-parametric methods K-means, NCUT and HAC yield the 
best and second best results for the Letter Recognition data. 


Eor completeness we also provide the micro Ei scores (also 
used in (HI) obtained with the MNIST, WAV, BCW and Letter 


Recognition datasets in Table III Based on this classification 


score, the proposed WD-EM method yields the best results 
for the WAV and BCW data, while the UGMM yields the best 
results for the Letter Recognition data, and the IGMM method 
yields the best results for the MNIST data. This comparison 
also shows that UGMM, GMM and GMM-rU yield similar 
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TABLE II: Results obtained with the MNIST, WAV, BCW, and Letter Recognition datasets. The clustering scores correspond 
to the Davies-Bouldin (DB) index. The best results are shown in underlined bold , and the second best results are shown in 
bold. The proposed method yields the best results for the WAV and BCW datasets, while PGMM yields the best results for 
the MNIST dataset. Interestingly, the non-parametric methods (K-means, HAC and Ncut) yield excellent results for Letter 
Recognition. 


Dataset 

WD-EM 

FWD-EM 

GMM 

GMM-i-U 

FM-uMST 

IGMM 

r^GMM 

K-Means 

KK-Means 

Ncut 

HAC 

MNIST 

2.965(0.15) 

3.104(0.21) 

3.291(0.14) 

3.245(0.09) 

2.443(0.00) 

3.555(0.06) 2.430(0.14) 

2.986(0.01) 

2.980(0.02) 

4.760(0.08) 

3.178(0.00) 

WAV 

0.975(0.00) 

1.019(0.00) 

1.448(0.03) 

1.026(0.04) 

1.094(0.10) 

1.028(0.02) 

2.537(0.35) 

1.020(0.00) ) 

0.975(0.05) 

2.781(0.06) 

1.089(0.00) 

BCW 

0.622(0.00) 

0.687(0.00) 

0.714(0.00) 

0.689(0.00) 

0.727(0.00) 

0.719(0.00) 

0.736(0.09) 

0.659(0.00) ) 

0.655(0.00) 

0.838(0.00) 

0.685(0.00) 

Letter Recognition 

1.690(0.00) 

1.767(0.01) 

2.064(0.06) 

2.064(0.06) 

1.837(0.00) 

2.341(0.11) 

1.724(0.03) : 

1.450(0.02) 

1.504(0.03) 

1.626(0.00) 

1.626(0.00) 

TABLE III: Micro Ei scores obtained on the real data sets (MNIST, WAV, BCW and Letter Recognition). The number in 
parenthesis indicates the standard deviation of 20 repetitions. Based on this classification score, I^GMM yields the best result. 

Data set 

WD-EM 

FWD-EM 

GMM 

GMM-f-U 

FM-uMST 

IGMM 

I^GMM 

K-Means 

KK-Means 

Ncut 

HAC 

MNIST 

0.524(0.01) 

0.455(0.01) 

0.573(0.00) 

0.549(0.01) 

0.519(0.00) 

0.689(0.02) 0.545(0.06) 

0.497(0.02) 

0.507(0.02) 

0.402(0.00) 

0.532(0.00) 

WAV 

0.774(0.00) 

0.534(0.00) 

0.535(0.00) 

0.552(0.00) 

0.632(0.08) 

0.543(0.01) 

0.493(0.00) 

0.521(0.00) 

0.522(0.00) 

0.387(0.00) 

0.597(0.00) 

BCW 

0.965(0.00) 

0.907(0.00) 

0.885(0.00) 

0.915(0.00) 

0.927(0.00) 

0.914(0.00) 

0.682(0.00) 

0.907(0.00) 

0.910(0.00) 

0.859(0.00) 

0.879(0.00) 

Letter Recognition 

0.315(0.01) 

0.323(0.00) 

0.423(0.00) 

0.423(0.00) 

0.379(0.00) 

0.306(0.02) 

0.466(0.01) 0.340(0.00) 

0.343(0.01) 

0.347(0.00) 

0.347(0.00) 


TABLE IV: DB scores obtained on the SIM-X dataset ( best and second best). 



Outliers 

WD-EM 

FWD-EM 

GMM 

GMM-bU 

FM-uMST 

IGMM 

I^GMM 

K-Means 

KK-Means 

Ncut 

HAC 


10% 

0.229(0.01) 

0.295(0.01) 

0.295(0.01) 

0.222(0.02) 

0.307(0.02) 

1.974(0.12) 

0.500(0.16) 

0.291(0.01) 

0.330(0.07) 

0.283(0.01) 

0.266(0.00) 

S' 

20% 

0.266(0.02) 

0.338(0.01) 

0.342(0.01) 

0.233(0.01) 

0.349(0.02) 

1.564(0.43) 

0.626(0.28) 

0.344(0.01) 

0.420(0.10) 

0.335(0.01) 

0.330(0.01) 

W 

30% 

0.330(0.01) 

0.385(0.01) 

0.384(0.02) 

0.227(0.02) 

0.501(0.04) 

1.296(0.12) 

0.570(0.27) 

0.372(0.01) 

0.381(0.03) 

0.366(0.02) 

0.376(0.01) 

So 

40% 

0.358(0.01) 

0.445(0.04) 

0.453(0.05) 

0.211(0.02) 

0.585(0.06) 

1.259(0.16) 

0.534(0.21) 

0.417(0.01) 

0.411(0.01) 

0.409(0.01) 

0.401(0.01) 


50% 

0.380(0.01) 

0.455(0.02) 

0.459(0.02) 

0.195(0.01) 

0.568(0.05) 

1.107(0.06) 

0.626(0.21) 

0.422(0.01) 

0.439(0.03) 

0.422(0.01) 

0.438(0.01) 

•a 

10% 

0.270(0.01) 

0.954(0.72) 

1.354(1.02) 

0.277(0.01) 

1.104(0.76) 

1.844(0.29) 

0.491(0.17) 

0.405(0.02) 

0.433(0.05) 

0.402(0.02) 

0.427(0.02) 

i 

20% 

0.329(0.03) 

4.503(4.33) 

3.003(1.85) 

0.269(0.01) 

1.181(0.44) 

1.278(0.45) 

0.591(0.13) 

0.512(0.02) 

0.515(0.03) 

0.477(0.03) 

0.529(0.02) 

Xt 

30% 

0.399(0.03) 

3.502(3.09) 

2.034(1.22) 

0.252(0.03) 

1.414(0.88) 

1.272(0.35) 

0.601(0.10) 

0.548(0.03) 

0.540(0.03) 

0.531(0.02) 

0.570(0.03) 

p 

2 

40% 

0.534(0.13) 

2.756(2.33) 

2.097(1.15) 

0.251(0.02) 

1.650(0.94) 

1.239(0.36) 

0.615(0.05) 

0.557(0.03) 

0.567(0.02) 

0.563(0.02) 

0.597(0.02) 


50% 

0.557(0.10) 

2.400(1.44) 

1.520(0.38) 

0.268(0.01) 

1.612(0.69) 

1.144(0.36) 

0.665(0.10) 

0.580(0.03) 

0.585(0.03) 

0.583(0.03) 

0.636(0.02) 

~a 

10% 

0.305(0.02) 

0.693(0.31) 

1.510(0.97) 

0.307(0.02) 

1.373(0.63) 

2.168(0.20) 

0.554(0.14) 

0.395(0.03) 

0.428(0.06) 

0.385(0.01) 

0.427(0.01) 

Oh 

Oh 

20% 

0.368(0.03) 

1.562(0.45) 

1.881(0.50) 

0.293(0.01) 

2.702(1.28) 

1.837(0.37) 

0.608(0.08) 

0.467(0.02) 

0.532(0.07) 

0.440(0.02) 

0.502(0.01) 

o 

> 

o 

2 

30% 

0.472(0.04) 

1.825(0.55) 

2.209(0.64) 

0.294(0.03) 

5.101(1.99) 

1.568(0.61) 

0.586(0.15) 

0.532(0.02) 

0.521(0.03) 

0.508(0.01) 

0.557(0.01) 

40% 

0.549(0.04) 

2.372(0.54) 

2.597(0.73) 

0.322(0.01) 

4.569(1.72) 

1.320(0.40) 

0.687(0.11) 

0.546(0.02) 

0.556(0.03) 

0.541(0.03) 

0.593(0.02) 

tn 

50% 

0.641(0.06) 

2.269(0.44) 

2.247(0.60) 

0.298(0.02) 

5.762(3.34) 

1.174(0.25) 

0.815(0.12) 

0.563(0.03) 

0.576(0.02) 

0.560(0.03) 

0.618(0.02) 


10% 

0.282(0.01) 

0.443(0.11) 

0.448(0.11) 

0.290(0.01) 

0.951(0.35) 

2.032(0.46) 

0.414(0.12) 

0.358(0.01) 

0.418(0.06) 

0.359(0.01) 

0.355(0.01) 

■a 

20% 

0.351(0.02) 

0.857(0.52) 

1.325(0.79) 

0.286(0.01) 

1.062(0.38) 

1.782(0.44) 

0.462(0.08) 

0.413(0.02) 

0.476(0.06) 

0.409(0.01) 

0.428(0.01) 

i 

30% 

0.396(0.02) 

1.368(0.74) 

1.524(0.64) 

0.278(0.01) 

1.693(0.56) 

1.627(0.54) 

0.483(0.07) 

0.454(0.02) 

0.464(0.04) 

0.449(0.01) 

0.468(0.01) 

2 

So 

40% 

0.449(0.03) 

1.100(0.61) 

1.188(0.59) 

0.277(0.02) 

1.609(0.43) 

1.456(0.34) 

0.483(0.05) 

0.478(0.02) 

0.504(0.04) 

0.478(0.01) 

0.508(0.02) 


50% 

0.492(0.03) 

1.364(0.59) 

1.513(0.67) 

0.265(0.01) 

1.972(0.86) 

1.366(0.29) 

0.562(0.04) 

0.501(0.01) 

0.515(0.02) 

0.499(0.02) 

0.546(0.02) 


An interesting feature of the proposed weighted-data clus¬ 
tering algorithms is their robustness in finding good clusters in 
the presence of outliers. To illustrate this ability we ran a large 
number of experiments by adding outliers, drawn from a uni¬ 
form distribution, to the four simulated datasets, e.g. Table [rV| 
and Eig. A comparison between WD-EM, EWD-EM, and 
the state-of-art clustering techniques mentioned above, with 
different percentages of outliers, is provided. As it can be 
easily observed in these tables, GMMh-U performs extremely 
well in the presence of outliers, which is not surprising since 
the simulated outliers are drawn from a uniform distribution. 
Overall, the proposed WD-EM method is the second best 
performing method. Notice the very good performance of the 
Ncut method for the SIM-overlapped data. Among all these 
methods, only GMMh-U and WD-EM offer the possibility to 


characterize the outliers using two very different strategies .The 
GMMh-U model simply pulls them in an outlier class based on 
the posterior probabilities. The WD-EM algorithm iteratively 
updates the posterior probabilities of the weights, and the final 
posteriors, (EHi, allow to implement a simple outlier detec¬ 
tion mechanism. Another important remark is that WD-EM 
systematically outperforms EWD-EM, which fully justifies the 
proposed weighted-data model. Eig. shows results of fitting 
the mixture models to SIM-mixed data drawn from a Gaussian 
mixture and contaminated with 50% outliers drawn from a 
uniform distribution. These plots show that GMM, IGMM, and 
I^GMM find five components corresponding to data clusters 
while they also fit a component onto the outliers, roughly 
centered on the data set. 











































(a) WD-EM 


(b) FWD-EM 


(c) GMM 


(d) GMM+U 



Fig. 2: Results obtained by fitting mixture models to the SIM-Mixed data in the presence of 50% outliers (see Table 


IVI. 


IX. Audio-Visual Clustering 

In this section we illustrate the effectiveness of our method 
to deal with audio-visual data which belong to the heteroge¬ 
nous type of data, i.e. gathered with different sensors, having 
different noise statistics, and different sources of errors. The 
challenges of clustering audio-visual data were enumerated 
in Section Prior to clustering one needs to represent audio 
and visual observations in the same Euclidean space, e.g. 
Fig. 1^ Without loss of generality we adopt the sound-source 
localization method of Il35l that performs 2D direction of 
anival (DOA) estimation followed by mapping the estimated 
sound-source direction onto the image plane: a DOA estimate 
therefore corresponds to a pixel location in the image plane. 
To find visual features, we use an upper-body detector ll^ 
that provides an approximate localization of human heads, 
followed by lip localization using facial landmark detection 
El. The rationale of combining upper-body detection with 
facial landmark localization is that, altogether this yields a 
detection and localization algorithm that is much more robust 
to head pose than the vast majority of face detection methods. 

Let A = e y ^ 

denote the set of auditory and visual observations respec¬ 
tively. To initialize the weight variables, we use ( [38] l in the 
following way. An auditory sample is given a high initial 
weight if it has many visual samples as neighbors, or Wa^ = 
exp(—rijj/cr). Visual weights are initialized in 
an analogous way, rt;„. = X)a gA As 

illustrated below, this cross-modal weighting scheme favors 
clusters composed of both auditory and visual observations. 
We recorded three sequences: 

• The fake speaker (FS) sequence, e.g. first and second 
rows of Fig. 1^ consists of two persons facing the camera 
and the microphones. While the person onto the right 
emits speech signals (counting from “one” to “ten”) the 


person onto the left performs fake lip, facial, and head 
movements as he would speak. 

• The moving speakers (MS) sequence, e.g. third and 
fourth rows of Fig. consists of two persons that move 
around while they are always facing the cameras and 





re 



Fig. 3: Audio-visual data acquisition and alignment. Top: left- 
and right-microphone signals. A temporal segment of 0.4 s is 
outlined in red. Middle: Binaural spectrogram that corresponds 
to the outlined segment. This spectrogram is composed of 
50 binaural vectors, each one being associated with an audio 
frame (shown as a vertical rectangle). Bottom: video frames 
associated with a segment. A sound-source direction of arrival 
(DOA) is extracted from each binaural vector and mapped onto 
the image plane, hence each green dot in the image plane 
corresponds to a DOA. 
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microphones. The persons take speech turns but there is 
a short overlap between the two auditory signals. 

• The cocktail party (CP) sequence, e.g. fifth and sixth 
rows of Fig. 1^ consists of four persons engaged in an 
informal dialog. The persons wander around and turn 
their heads towards the active speaker; occasionally two 
persons speak simultaneously. Moreover the speakers do 
not always face the camera, hence face and lip detec¬ 
tion/localization are unreliable. 

The visual data are gathered with a single camera and the 
auditory data are gathered with two microphones plugged into 
the ears of an acoustic dummy head, referred to as binaural 
audition. The visual data are recorded at 25 video frames per 
second (FPS). The auditory data are gathered and processed 
in the following way. First, the short-time Fourier transform 
(STFT) is applied to the left- and right-microphone signals 
which are sampled at 48 KHz. Second, the left and right 
spectrograms thus obtained are combined to yield a binaural 
spectrogram from which a sound-source DOA is estimated. A 
spectrogram composed of 512 frequency bins is obtained by 
applying the STFT over a sliding window of width 0.064 s and 
shifted along the signal with 0.008 s hops. An audio frame, 
or 512 frequency bins, is associated with each window, hence 
there are 125 audio frames per second (with 0.056 ms overlap 
between consecutive frames). Both the visual and audio frames 
are further grouped into temporal segments of width 0.4 s, 
hence there are 10 visual frames and 50 audio frames in each 
segment. 

As already mentioned, we follow the method of ll35l to 
extract a sound-source DOA from each audio frame. In order 
to increase the robustness of audio localization, a voice activity 
detector (VAD) ll^ is first applied to each frame, such that not 
all the frames have DOA estimates associated with them. On 
an average there are 40 audio DOA observations per segment. 
The FS sequence contains 28 segments, the MS sequence 
contains 43 segments, while the CP sequence contains 115 
segments. The left hand sides of Fig. show the central frame 
of a segment with all the visual features (blue) and auditory 
features (green) available within that segment. 

We tested the proposed WD-EM algorithm on these audio¬ 
visual data as well as the GMMh-U II^ and FM-uMST ||8l 
algorithms. We chose to compare our method with these two 
methods for the following reasons. Firstly, all three methods 
are based on finite mixtures and hence they can use a model 
selection criterion to estimate the number of components in 
the mixture that best approximates clusters in the data. This is 
important since the number of persons and of active speakers 
among these persons are not known in advance. Secondly, as 
demonstrated in the previous section, these three methods yield 
robust clustering in the presence of outliers. 

WD-EM uses the MML criterion for model selection as 
described in Section We implemented a model selection 
criterion based on BIC to optimally select the number of com¬ 
ponents with GMMh-U and EM-uMST. While each algorithm 
yields an optimal number of components for each audio-visual 
segment, not all them contain a sufficient number of audio and 


TABLE V: The correct detection rates (CDR) obtained with 
the three methods for three scenarios; fake speaker (ES), 
moving speakers (MS), and cocktail party (CP). 


Scenario 

# Segments 

WD-EM 

GMM-U (30) 

FM-uMST 0 

FS 

28 

100.00% 

100.00% 

71.43% 

MS 

43 

83.87% 

61.90% 

72.22% 

CP 

115 

65.66% 

52.48% 

49.57% 


visual observations, such that the component can be associated 
with an active speaker. Therefore, we apply a simple two-step 
strategy, firstly to decide whether a component is audio-visual, 
audio-only, or visual-only, and secondly to select the best 
audio-visual components. Let riy and Ua be the total number 
of visual and audio observations in a segment. We start by 
assigning each observation to a component: let ria and Tiy 
be the number of audio and visual observations associated 
with component k. Let rk = min{n^,n^}/(na + n„) measure 
the audio-visual relevance of a component. If r/j > s then 
component k corresponds to an active speaker, with s being a 
fixed threshold. 

Fig. 0 shows examples of applying the WD-EM, GMMh-U 
and EM-uMST algorithms to the three sequences. One may 
notice that, while the visual observations (blue) are very 
accurate and form small lumps around the moving lips of 
a speaker (or of a fake speaker), audio observations (green) 
are very noisy and have different statistics; this is due to the 
presence of reverberations (the ceiling in particular) and of 
other sound sources, such as computer fans. The ground-truth 
active speaker is shown with a yellow frame. The data clusters 
obtained by the three methods are shown with red ellipses. A 
blue disk around a cluster center designates an audio-visual 
cluster. Altogether, one may notice that the proposed method 
outperforms the two other methods. An interesting feature 
of WD-EM is that the weights give more importance to the 
accurate visual data (because of the low-variance groups of 
observations available with these data) and hence the audio¬ 
visual cluster centers are pulled towards the visual data (lip 
locations in these examples). 

To further quantify the performance of the three methods, 
we carefully annotated the data. Eor each segment, we identi¬ 
fied the active speaker and we precisely located the speaker’s 
lips. Let Xg be the ground-truth lip location. We assign Xg to 
a component by computing the maximum responsibility 
of Xg. When Xg is assigned to an audio-visual cluster, an 
active speaker is said to be correctly detected if the posterior 
probability of Xg is equal or greater than 1/K, where K is 
the number of components. Table |V] summarizes the results 
obtained with the three methods. 

X. Conclusions 

We presented a weighted-data Gaussian mixture model. We 
derived a maximum-likelihood formulation and we devised 
two EM algorithms, one that uses fixed weights (EWD- 
EM) and another one with weights modeled as random vari- 
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Data & Active speaker 


FS-A 





MS-A 


MS-B 



CP-A 



CP-B 


WD-EM 



Correct 




Correct 



Correct 



Correct 



Correct 


GMM+U 



Correct 




Correct 



Correct 


Incorrect 



Correct 



Incorrect 



FM-uMST 


Correct 



Incorrect 



Correct 



Incorrect 



Incorrect 



Correct 


Fig. 4: Results obtained on the fake speaker (FS), moving speaker (MS) and cocktail party (CP) sequences. The first column 
shows the audio (green) and visual (blue) observations, as well as a yellow bounding box that shows the ground-truth active 
speaker. The second, third and fourth columns show the mixture components obtained with the WD-EM, GMMh-U and FM- 
uMST methods, respectively. The blue disks mark components that correspond to correct detections of active speakers, namely 
whenever there is an overlap between a component and the ground-truth bounding box. 
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ables (WD-EM). While the first algorithm appears to be a 
straightforward generalization of standard EM for Gaussian 
mixtures, the second one has a more complex structure. We 
showed that the expectation and maximization steps of the 
proposed WD-EM admit closed-form solutions and hence the 
algorithm is extremely efficient. Moreover, WD-EM performs 
much better than EWD-EM which fully justifies the proposed 
generative probabilistic model for the weights. We extended 
the MML-based model selection criterion proposed in ITSll to 
the weighted-data Gaussian mixture model and we proposed 
an algorithm that finds an optimal number of components 
in the data. Interestingly, the WD-EM algorithm compares 
favorably with several state-of-the-art parametric and non- 
parametric clustering methods; it performs particularly well 
in the presence of a large number of outliers, e.g. up to 50% 
of outliers. Hence, the proposed formulation belongs to the 
robust category of clustering methods. 

We also applied WD-EM to the problem of clustering 
heterogenous/multimodal data sets, such as audio-visual data. 
We briefly described the audio-visual fusion problem and 
how it may be cast into a challenging audio-visual clustering 
problem, e.g. how to associate human faces with speech 
signals and how to detect and localize active speakers in 
complex audio-visual scenes. We showed that the proposed 
algorithm yields better audio-visual clustering results than two 
other finite-mixture models, and this for two reasons: (i) it is 
very robust to noise and to outliers and (ii) it allows a cross- 
modal weighting scheme. Although not implemented in this 
paper, the proposed model has many other interesting features 
when dealing with multimodal data; it enables to balance the 
importance of the modalities, to emphasize one modality, or to 
use any prior information that might be available, for example 
by giving high weight priors to visual data corresponding to 
face/lip localization. 
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